skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Yang, Lishan"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available June 28, 2026
  2. Convolutional neural networks (CNN) are incorporated into many image-based tasks across a variety of domains. Some of these are safety critical tasks such as object classification/detection and lane detection for self-driving cars. These applications have strict safety requirements and must guarantee the reliable operation of the neural networks in the presence of soft errors (i.e., transient faults) in DRAM. Standard safety mechanisms (e.g., triplication of data/computation) provide high resilience, but introduce intolerable overhead. We perform detailed characterization and propose an efficient methodology for pinpointing critical weights by using an efficient proxy, the Taylor criterion. Using this characterization, we design Aspis, an efficient software protection scheme that does selective weight hardening and offers a performance/reliability tradeoff. Aspis provides higher resilience comparing to state-of-the-art methods and is integrated into PyTorch as a fully-automated library. 
    more » « less
  3. Graphics Processing Units (GPUs) are widely de-ployed and utilized across various computing domains including cloud and high-performance computing. Considering its extensive usage and increasing popularity, ensuring GPU reliability is cru-cial. Software-based reliability evaluation methodologies, though fast, often neglect the complex hardware details of modern GPU designs. This oversight could lead to misleading measurements and misguided decisions regarding protection strategies. This paper breaks new ground by conducting an in-depth examination of well-established vulnerability assessment methods for modern GPU architectures, from the microarchitecture all the way to the software layers. It highlights divergences between popular software-based vulnerability evaluation methods and the ground truth cross-layer evaluation, which persist even under strong protections like triple modular redundancy. Accurate evaluation requires considering fault distribution from hardware to software. Our comprehensive measurements offer valuable insights into the accurate assessment of GPU reliability. 
    more » « less
  4. Cecarelli, Andrea; Trapp, Mario; Bondavalli, Andrea; Bitsch, Friedemann (Ed.)
    Simulation-basedFaultInjection(FI)ishighlyrecommended by functional safety standards in the automotive and aerospace domains, in order to “support the argumentation of completeness and correctness of a system architectural design with respect to faults” (ISO 26262). We argue that a library of failure models facilitates this process. Such a library, firstly, supports completeness claims through, e.g., an extensive and systematic collection process. Secondly, we argue why failure model specifications should be executable—to be implemented as FI operators within a simulation framework—and parametrizable—to be relevant and accurate for different systems. Given the distributed nature of automo- tive and aerospace development processes, we moreover argue that a data-flow-based definition allows failure models to be applied to black- box components. Yet, existing sources for failure models provide frag- mented, ambiguous, incomplete, and redundant information, often meet- ing neither of the above requirements. We therefore introduce a library of 18 executable and parameterizable failure models collected with a sys- tematic literature survey focusing on automotive and aerospace Cyber- Physical Systems (CPS). To demonstrate the applicability to simulation- based FI, we implement and apply a selection of failure models to a real- world automotive CPS within a state-of-the-art simulation environment, and highlight their impact. 
    more » « less
  5. Data center downtime typically centers around IT equipment failure. Storage devices are the most frequently failing components in data centers. We present a comparative study of hard disk drives (HDDs) and solid state drives (SSDs) that constitute the typical storage in data centers. Using six-year field data of 100,000 HDDs of different models from the same manufacturer from the Backblaze dataset and six-year field data of 30,000 SSDs of three models from a Google data center, we characterize the workload conditions that lead to failures. We illustrate that their root failure causes differ from common expectations and that they remain difficult to discern. For the case of HDDs we observe that young and old drives do not present many differences in their failures. Instead, failures may be distinguished by discriminating drives based on the time spent for head positioning. For SSDs, we observe high levels of infant mortality and characterize the differences between infant and non-infant failures. We develop several machine learning failure prediction models that are shown to be surprisingly accurate, achieving high recall and low false positive rates. These models are used beyond simple prediction as they aid us to untangle the complex interaction of workload characteristics that lead to failures and identify failure root causes from monitored symptoms. 
    more » « less
  6. Mourlas, cotas; Pacheco, Diego; Pandi, Catia (Ed.)
    We present an individual-centric agent-based model and a flexible tool, GeoSpread, for studying and predicting the spread of viruses and diseases in urban settings. Using COVID-19 data collected by the Korean Center for Disease Control & Prevention (KCDC), we analyze patient and route data of infected people from January 20, 2020, to May 31, 2020, and discover how infection clusters develop as a function of time. This analysis offers a statistical characterization of population mobility and is used to parameterize GeoSpread to capture the spread of the disease. We validate simulation predictions from GeoSpread with ground truth and we evaluate different what-if counter-measure scenarios to illustrate the usefulness and flexibility of the tool for epidemic modeling. 
    more » « less
  7. null (Ed.)
  8. null (Ed.)
    As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%. 
    more » « less